dontmerge: support talkie 13b#508
Draft
georgewhewell wants to merge 3 commits intohellas-ai:masterfrom
Draft
Conversation
`get_model_files` and `get_model_chat_template` now treat the model identifier as a local directory if it's an existing path on disk; that directory must look like a HuggingFace snapshot (config.json, tokenizer.json, tokenizer_config.json, and either model.safetensors or model.safetensors.index.json + shards). Otherwise the existing HF hub download path is used unchanged.
Talkie is a 40-layer/40-head decoder-only transformer (talkie-lm.com,
github.com/talkie-lm/talkie) with the standard Llama backbone plus four
small departures, all expressible with existing catgrad operators:
1. RMSNorm everywhere is unweighted (F.rms_norm with no gamma),
including a norm immediately after the embedding.
2. QK-norm — RMSNorm is applied to Q and K after RoPE.
3. Per-head and per-layer learned gains — head_gain ([H]) on Q after
QK-norm, and scalar attn_gain / mlp_gain / embed_skip on the
residual branches.
4. Embedding-skip residual — the post-input-norm activations are
threaded through every block as e_x and added back via a learned
scalar.
The lm_head is an untied [V, D] parameter (not a Linear) scaled by a
learned scalar (lm_head_gain.w_g) before the final matmul. Talkie's
RoPE uses the opposite sin convention from catgrad's default; we negate
cache.sin once after init to match.
Architecture string: TalkieForCausalLM. End-to-end inference reproduces
the upstream pytorch reference byte-for-byte at greedy argmax for short
sequences in bf16; on longer sequences the cross-implementation bf16
noise floor (Metal vs CPU) flips one borderline argmax per ~40 tokens
on some prompts. Test harness in scripts/compare/talkie_compare.sh.
Helpers:
- scripts/convert_talkie.py: pickle -> safetensors + tokenizer + config
- scripts/llm_talkie.py: greedy-argmax pytorch reference
- scripts/compare/talkie_compare.sh: token-level stability matrix
The decoder stack now reads from `model.embed.weight`,
`model.blocks.{i}.…` — matching the HF port at
`lewtun/talkie-1930-13b-it-hf` (`TalkieForCausalLM` with `self.model =
TalkieModel(…)` and `lm_head`/`lm_head_gain.w_g` at the root).
That repo includes a full HF-format checkpoint plus a `tokenizer.json`
already in HF tokenizers form, so our pickle→safetensors converter and
greedy-argmax reference are no longer needed:
- rm catgrad-llm/scripts/convert_talkie.py
- rm catgrad-llm/scripts/llm_talkie.py
- rm catgrad-llm/scripts/compare/talkie_compare.sh
End-to-end run:
./target/release/examples/llama -m lewtun/talkie-1930-13b-it-hf \
-k -s 60 --dtype bf16 -p "Write a short poem about the wireless telegraph."
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.